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INTRODUCTION 


1  . 


Advances  in  modern  computer  technology  have  led  to  an 
exponential  growth  in  the  accumulation  and  storage  of 
information.  This  is  especially  apparent  in  the  scientific 
and  technical  literature .  where  both  the  supply  and  demand  for 
information  are  rapidly  expanding.  DoD  scientists,  engineers, 
and  technicians,  as  a  result,  are  confronted  with  an 
overwhelming  abundance  of  information.  It  is  becoming 
increasingly  difficult  and  time-consuming  to  locate  and 
retrieve  relevant  information  from  the  growing  volume  of 
technical  literature  and  documentation.  Similarly,  it  is 
becoming  increasingly  difficult  and  time-consuming  to  revise 
or  update  information  in  large  text  and  graphics  databases. 

One  might  expect  that  the  same  technology  which  contributed 
to  these  problems  would  offer  solutions.  Indeed,  a  widespread 
transition  is  underway  from  paper-based  documents  to 
machine- readable  textfiles,  optical  disks,  and  so  forth,  in 
order  to  allow  on-line  authoring,  searching,  and  data 
modification.  Yet  current  computer-based  retrieval  systems 
perform  poorly  and  are  difficult  to  use.  Furthermore,  current 
systems  deliver  only  bibliographic  citations  and  abstracts,  or 
at  best  the  documents  themselves,  rather  than  retrieving 
information  contained  in  the  documents. 

Since  1978,  Hughes  Support  Systems  has  been  engaged  in  an 
effort  to  resolve  some  of  the  problems  associated  with  on-line 
technical  documentation.  In  1982  this  effort  began  to  focus 
on  problems  of  information  retrieval  using  artificial 
intelligence  (AI)  techniques  to  supplant  inadequate 
conventional  approaches.  An  IR5fD  project,  known  as 
Associative  Loop  Memory  or  ALOOF,  was  undertaken  with  the  aim 
of  developing  a  software  hardware  system  to  extract  and  manage 
a  database^of  index  terms  derived  from  natural  language 
documents.  This  retrieval  system  provides  the  user  with  a 
graphical  display  which  can  be  browsed  and  manipulated  to 
retrieve  knowledge  from  an  on-line  documents  database.  An 
overview  of  this  approach  is  depicted  in  figure  1 . 


^Landauer.  T.K.,  Dumais,  S.T.,  Gomez,  L.M. ,  and  Furnas, 
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Figure  1.  Overview  of  a  system  for  automated  indexing. 


The  term  "knowledge  engineering"  refers  to  the  procedures 
used  by  ^  trained  user  (the  knowledge  engineer)  to  manually 
insert  knowledge,  correct  errors,  or  resolve  ambiguities  in 
the  knowledge  base.  Recent  efforts  on  the  ALOOF  project  have 
been  directed  toward  the  production  of  software  for  natural 
language  understanding  and  the  addition  of  knowledge 
engineering  tools.  Although  mutually  reinforcing,  the  IRGfD 
effort  and  the  experiments  performed  under  the  present 
contract  do  not  overlap. 


1 . 1  Contract  objectives 

Under  a  previous  contract ,  which  was  monitored  by  the  Army 
Research  Institute,  we  developed  a  knowledge  representation 
scheme  for  indexing  technical  text  and  procedures  for 
inferring  semantic  relations  between  index  terms. 

The  aim  of  the  present  contract  was  to  further  investigate 
concepts  related  to  the  development  of  intelligent,  automated 
text  indexing.  Our  objectives  were  (1)  to  construct  knowledge 
bases  of  index  terms  and  semantic  relations  extracted  from 
full  text  documents  and  (2)  to  investigate  approaches  to 
automating  this  extraction  process  including  work  involving 
semantic  knowledge  representation,  knowledge  engineering, 
natural  language  understanding,  inferencing.  and  methods  for 
handling  uncertainty.  These  objectives  were  met  through  a 
number  of  experiments  we  conducted  with  small  text  databases. 


1  ■  2  EBYirQBmfiBt  £.Qt  QQBdBCtiBg  fixperimeBts 

The  experiments  performed  under  this  contract  used  the 
Interlisp-D  language  running  on  a  Xerox  1100  AI  workstation 
residing  in  our  laboratory.  Some  of  the  experiments  used 
proprietary  software  developed  under  the  Hughes  ALOOF  project. 
The  Xerox  Lisp  workstation  is  specially  designed  for  AI 
developmental  work  and  supports  high-resolution  bit-mapped 
graphics  with  windows  and  menus,  uses  a  mouse  for  interacting 
with  the  screen,  and  has  29  megabytes  of  local  disk  storage. 
Our  workstation  is  attached  to  a  DEC  VAX  11/780  through  an 
Ethernet  connection  for  remote  file  storage  and  access  to 
textfiles,  editors,  and  output  devices. 


1 . 3  StatBS  Qf  iBf CTEatiQB  xetiieval 

The  traditional  objective  of  technical  libraries  and  of 
most  other  information  retrieval  efforts  has  been  to  deliver 
relevant  documents  or  pointers  to  documents  (citations)  along 
with  short  summaries.  The  training  of  retrieval  specialists 


and  the  process  of  abstracting  and  cataloguing  of  information 
for  these  systems  are  difficult,  expensive,  and 
time-consuming.  Moreover,  the  tasks  of  assessing  the  actual 
relevance  of  these  documents  and  of  finding  and  extracting  the 
relevant  information  has  been  left  to  the  user. 

With  the  advent  of  inexpensive  mass  storage  devices,  it  is 
becoming  increasingly  feasible  to  store  the  full  text  of 
documents  in  on-line  databases.  The  development  of  devices 
for  the  rapid  input  of  text,  such  as  wordprocessors  and 
optical  scanners,  has  also  contributed  to  the  rising 
popularity  of  full-text  databases.  These  advances  make  it 
considerably  easier  to  physically  store  and  retrieve  text  and 
graphics,  yet  the  problem  of  identifying  relevant  knowledge  in 
the  database  remains . 

Not  uncommonly,  the  typical  end  user  of  an  on-line  database 
collection  may  have  only  a  vague  idea  of  what  he  or  she  is 
seeking  and  cannot  provide  a  precise  specification  for 
conducting  a  search.  Even  a  well-defined  query,  however,  must 
be  translated  into  a  more  constrained  and  often  more  ambiguous 
format  for  input  to  the  system,  usually  by  a  trained  human 
intermediary.  Indeed,  a  significant  area  of  research  in 
information  retrieval  is  concerned  withgautomating  the  query 
translation  and  narrowing  operations.  ’  Another  similar 
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approach  constructs  relational  thesauri  ^o  improve  retrieval 
performance  in  response  to  user  queries. 

Current  on-line  information  retrieval  systems  illustrate 
these  problems.  One  common  approach  is  to  present  to  the  user 
a  nested  menu  display  which  contains  subject  headings  similar 
to  those  found  in  a  traditional  subject  index  of  a  printed 
document .  However ,  not  only  is  this  type  of  index  expensive 
and  time-consuming  to  produce,  but  it  is  again  of  little  use 
to  a  user  who  does  not  know  or  cannot  find  the  subject  heading 
for  the  information  she  or  he  is  seeking. 

Another  common  retrieval  approach  uses  keyword  query 
techniques.  These  systems  require  sophisticated  search 
strategies  involving  the  selection  of  appropriate  index  terms 
and  Boolean  operators.  As  a  result,  most  searches  are 
conducted  by  highly  trained  intermediaries  rather  than  by 
untrained  end  users.  Furthermore,  it  is  often  difficult  to 
narrow  a  search  with  these  methods;  a  given  query  may  retrieve 
either  hundreds  of  documents  or  none.  Query  systems  focus  on 
only  two  of  the  many  possible  retrieval  cues  found  in  full 
text  documents--the  frequency  and  proximity  of  word 
occurrences.  Therefore,  the  response  to  a  word  like 
"terminal"  may  include  instances  ranging  from  "computer 
terminal"  to  "terminal  disease"  to  "airline  terminal" 
constrained  only  by  the  breadth  of  the  database  (and  by 
subsequent  Boolean  quantifiers). 


2.  AN  AI  APPROACH  TO  INFORMATION  RETRIEVAL 

We  can  perhaps  best  define  our  Al-based  approach  to  the 
information  retrieval  problem  by  contrasting  it  with  the 
traditional  systems  described  above.  We  intend  to  develop  a 
system  that  does  not  require  exact  knowledge  of  the  objective 
of  a  search  but  can  be  used  by  an  unsophisticated  user  to 
rapidly  retrieve  relevant  sections  from  a  large  on-line 
collection  of  full  text  documents.  We  assume  that  the  user 
will  recognize  the  information  when  retrieved.  The  system 


Wang,  Y. .  Vandendorpe,  J.,  and  Evens,  M.  "Relational 
thesauri  in  information  retrieval."  slQUEBfll  Of  the  AmeriesE 
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will  allow  the  user  to  quickly  judge  relevance  by  delivering 
specific  passages  of  the  text  rather  than  entire  documents. 
Alternative  search  paths  will  be  provided  if  a  search  fails  to 
retrieve  the  desired  information.  Further,  the  system  will  be 
dynamic:  over  time,  it  will  improve  its  ability  to  process 
text  and  to  retrieve  relevant  information  by  making  use  of 
knowledge  supplied  by  a  knowledge  engineer,  information  from 
text  it  has  previously  read,  and  interactions  with  end  users. 
Finally,  the  system  will  extract  index  terms  and  semantic 
relations  from  the  text  automatically  in  order  to  overcome  the 
bottleneck  of  knowledge  acquisition  which  plagues  most  current 
AI  systems.  In  brief,  our  objective  is  to  present  the  user 
with  an  easily  understood  graphics  interface  which  can  be  used 
to  intelligently  query  and  browse  through  a  massive  text 
database . 


2 . 1  Semafitic  network  aS  a  browsing  index 

Semantic  networks  are  a  common  technique  for  representing 
knowledge  in  AI  systems.  gOriginally  proposed  as  a  model  of 
human  associative  memory,  they  have  come  to  be  regarded  as 
convenient  structures  for  representing  certain  kinds  of 
knowledge  and  for  organizing  this  knowledge  in  a  way  that  is 
convenient  for  making  inferences.  Semantic  networks  represent 
knowledge  as  linked  together  by  semantic 

associations.  Associative  memory  recall  can  then  be  modeled 
as  the  traversal  of  these  links. 

Figure  2  depicts  a  small  portion  of  a  sample  semantic 
network  used  for  browsing  text.  Notice  that  this  network  is 
structured  so  that  vertical  links  represent  a  hierarchy  of 
information  from  general  to  specific  (e.g.,  VALVE  type 
REGULATOR)  while  horizontal  links  represent 
agent-action-object  (PLUMBER  do  SHUT  doto  VALVE)  or 
object-attribute-value  relations.  For  the  net  to  be  used  as  a 
text  index,  we  attach  pointers  to  the  semantic  links 
corresponding  to  sections  of  text  that  contain  the  information 
underlying  the  semantic  relations  represented  in  the  network. 
The  user,  by  selecting  two  linked  index  terms  with  a  mouse  or 
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Quillian,  M.R.  "Semantic  memory."  In  M.  Minsky  (Ed.). 
SeEantic_InfQrmatiQn  ProcessiEg ■  MIT  Press.  Cambridge,  Mass., 
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When  water  breaks  loose,  the  first  thing  to  do  is  shut 
it  off  at  the  source.  Ifyouhavetiineto  think,  pick  the 
valve  nearest  the  leak.  But  il  tbiugs  are  moving  too  last, 
bead  first  for  tbe  gate  valve  and  shut  down  tbe  whole  system 
while  you  plot  tbe  rest  of  your  attack. 

These  are  typical  valves.  Know  whereto  find  the  ones 
in  your  house. 


Figure  2.  Semantic  network  as  a  browsing  index; 

selecting  a  link  in  the  semantic  network 
retrieves  a  corresponding  full  text  passage 


Other  input  device,  can  retrieve  the  text  passage 
corresponding  to  the  pointer.  This  approach  results  in  a  very 
hrowsable  index  ordered  by  the  natural  relations  beween  index 
terms,  rather  than  by  the  arbitrary  alphabetical  ordering 
found  in  standard  printed  indices. 

2 . 2  SeEastiQ  iaferessing 

While  a  traditional  index  is  a  very  simple  representation 
of  the  knowledge  in  a  text ,  a  semantic  network  representation 
is  much  more  structured  and  useful  for  performing  inferences . 

A  semantic  network  typically  employs  a  set  of  semantic 
primitives  which  specify  the  types  of  relations  which  can  be 
represented  in  the  system.  Our  system  uses  a  set  of  16 
primitives  (type,  inst[anGe],  part,  do,  doto.  has,  is.  and 
EQdlifies]  and  their  negations)  to  capture  relationships 
involving  class  membership,  part /whole  relations,  actions,  and 
properties.  Each  node  taken  in  conjunction  with  its  relations 
can  be  thought  of  as  a  concept  or  a  frame  (another  common  form 
of  AI  knowledge  representation). 

The  vertical  membership  relations  in  a  semantic  network 
permit  the  system  to  make  logical  inferences  using  the 
knowledge  it  has  stored.  For  example,  if  a  network  represents 
the  relations  that  a  robin  is  a  type  of  bird  and  that  all 
birds  have  feathers  and  can  fly,  then  the  system  can  infer 
that  a  robin  has  feathers  and  can  fly.  Similarly,  from  the 
relations  that  a  plumber  can  repair  any  plumbing  device  and  a 
faucet  is  a  plumbing  device,  the  system  can  infer  that  a 
plumber  can  repair  a  faucet.  This  inferencing  ability  permits 
a  more  efficient  knowledge  representation,  in  which  attributes 
common  to  all  members  of  a  class  can  be  attached  to  the  class 
rather  than  being  duplicated  for  each  member.  Further,  the 
system  can  use  inferences  during  the  processing  of  text  to 
understand  new  words  and  to  relate  knowledge  from  the  text 
to  knowledge  already  stored  in  the  system. 


2 . 3  ALQQP  hardware 

The  semantic  networks  constructed  to  index  and  represent 
the  knowledge  in  a  large  document  collection  will  obviously 
need  to  be  very  large.  Real-world  databases  typically  contain 
thousands  of  documents  pertaining  to  a  large  variety  of 
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subjects.  Conventional  serial  computers  may  not  be  able  to 
perforin  searches  of  these. networks  rapidly  enough  to  provide  a 
reasonable  response  time.  To  cope  with  this  problem,  we  are 
also  exploring,  as  part  of  our  IR^fD  effort,  the  implementation 
of  the  ALOOF  concept  as  special-purpose,  parallel-processing 
hardware . 

The  large  semantic  network  would  be  segmented  and  stored  in 
a  number  of  independent  processing  nodes ,  each  of  which  would 
contain  a  microprocessor  and  local  memory  (fig.  3).  Searches 
could  then  be  performed  in  parallel,  simultaneously  in  each  of 
these  nodes,  rather  than  by  serial  searching  of  the  entire 
network.  A  small  working  hardware  prototype  was  developed  in 
1982,  and  we  anticipate  a  further  hardware  effort  as  we  begin 
to  work  with  larger  databases. 


3.  CREATION  OF  KNOWLEDGE  BASES 

One  of  our  objectives  under  the  present  contract  was  to 
develop  and  study  knowledge  bases  of  index  terms  and  semantic 
relations  extracted  from  different  text  domains.  Three 
general  domains  were  chosen  for  analysis:  research  abstracts, 
maintenance  manuals,  and  encyclopedic  texts.  Knowledge  bases 
were  created  for  these  domains  using  representative 
doouments--f or  the  abstracts  collection,  DoD  form  1496s 
pertaining  to  training;  for  the  maintenance  domain,  a  depot 
maintenance  work  requirement  of  a  TOW  weapons  system  and 
sections  of  a  home  maintenence  manual;  and  for  the  general 
text,  a  short  segment  from  a  geology  textbook.  This  latter 
text  was  chosen  to  allow  comparison  of  our  apjgoach  with  a 
published  study  of  human  indexing  performance’’  (sect.  4.1 
below) . 

The  process  cf  creating  knowledge  bases  was  very  different 
for  these  domains.  Creating  a  knowledge  base  from  the 
abstracts  collection  proved  to  be  most  difficult.  This 
difficulty  apparently  stems  in  part  from  the  fact  that 
abstracts  are  abbreviated  texts  rather  than  full  text 
documents.  The  language  used  in  abstracts  is  condensed  and 
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unnatural;  there  is  a  tendency  to  use  longer  words,  more 
complicated  noun  phrases  (e.g..  an  engagement  simulation-based 
army  training  evaluation  program)  and  long,  convoluted,  and 
incomplete  sentences.  Even  humans  often  have  difficulty 
understanding  these  abstracts.  Some  examples  are  presented  in 
figure  4. 

Even  more  problematic  from  the  standpoint  of  indexing  is 
the  absence  of  terms  which  can  be  used  to  identify  the  content 
of  the  abstracts.  The  research  summaries  tend  to  be  written 
in  a  very  general  and  abstract  language.  Yet  another  problem 
with  this  domain  was  the  use  of  a  standardized  format  which 
often  caused  the  abstracts  to  read  alike. 

The  language  found  in  maintenance  manuals  is  also 
constrained  and  unlike  that  generally  encountered  in  text .  By 
necessity  the  information  contained  in  these  documents  is 
highly  process-oriented  and  presented  in  terms  of  symptoms  and 
corrective  actions  or  as  specific  sequential  instructions. 
These  documents  also  tend  to  use  domain-specific  terminology 
and  to  emphasize  spatial  relations  (i.e.,  where  to  find  a  part 
in  relation  to  other  parts)  which  are  not  represented  in  our 
semantic  primitives.  Another  problem  is  the  reliance  upon 
tables  and  figures  to  convey  information,  which  cannot  be 
captured  by  straightforward  text  analysis  procedures. 

The  general  expository  text  was  most  readily  processed  by 
our  approach.  However,  the  effort  with  this  text  also 
highlighted  the  need  to  incorporate  very  low  frequency  terms, 
to  handle  modifier-noun  phrases,  and  to  infer  relations 
involving  terms  which  may  not  be  found  in  the  text .  More 
discussion  of  this  text  is  presented  in  section  4.1. 

In  general,  we  concluded  from  our  experience  with  these 
texts  that  initial  generic  and  domain-specific  knowledge  was 
needed  to  accurately  process  and  represent  the  texts. 
Appropriate  general  knowledge  of  of  objects,  actions,  and 
attributes ,  as  well  as  knowledge  of  the  important  concepts  and 
relations  of  a  specific  domain  can  be  provided  by  a  knowledge 
engineer  to  greatly  facilitate  the  analysis  of  a  text  for 
index  terms  and  semantic  relations.  We  refer  to  this  initial 
background  knowledge  as  "seed  knowledge."  The  construction  of 
hierarchically  ordered  networks  for  browsing  through  the 
database  is  critically  dependent  upon  the  scope  and  accuracy 
of  the  seed  knowledge.  We  also  estimated  the  optimal  length 
of  a  text  sample  for  indexing  to  be  on  the  order  of  2,000  to 
10,000  words. 


Create  a  methodology  for  the  development 
of  "relevance  values"  to  career 
enhancement  and  for  the  definition  of 
functional  overlap  of  separate 
external-to-specialty  assignment  options 
for  the  career  progression  demands  of  the 
individual  officer  in  his  own  assigned 
specialties;  develop  an  assignment 
algorithm  methodology. 

Evaluate  the  potential  of  quasi-algorithm 
methods  and  techniques  for  specifying 
objective  job/task  descriptions  and 
performance  requirements  that  are  related 
to  job  structures,  work  requirements, 
training,  and  personnel  management. 

Systematically  investigate  organizational 
effectiveness  (OE)  as  a  process,  including 
principals  within  it  and  the  nature  of 
their  behavioral  dynamics;  develop  an 
operational  description  of  the  conditions 
for  and  dynamics  of  the  OE  process. 

Improve  individual  and  unit  proficiency  of 
army  personnel  in  selected  military 
systems  by  developing  guidelines  and 
recommendations  for  the  effective  transfer 
of  training  technology  to  the  army  user. 

Develop  a  productivity  measurement  system 
for  use  in  civilian  personnel  management 
to  provide  input  into  training  for 
civilian  and  military  managers  on  the 
relationship  between  personnel  management 
practices  and  mission  accomplishment . 


Figure  4.  Excerpts  from  abstracts  database  of  form  1498s 
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From  our  experiments,  we  can  begin  to  obaracterize  wbat 
information  the  seed  knowledge  should  contain.  Most  important 
for  organizing  the  knowledge  base  is  an  initial  set  of  very 
abstract  terms  pertaining  to  the  domain  or  to  general 
concepts.  These  terms  may  rarely  be  encountered  in  text  but 
they  provide  a  taxonomy  of  objects  which  is  useful  for 
browsing  and  for  making  inferences.  Ideally,  this  taxonomy 
should  be  carried  down  to  the  level  of  the  major  keywords 
which  do  occur  in  the  text.  An  example  for  the  home  plumbing 
repair  text  is  depicted  in  figure  5 .  A  knowledge  engineer 
does  not  necessarily  need  to  provide  extensive  seed  knowledge 
before  a  text  is  analyzed,  but  can  be  examine  the  semantic 
networks  created  after  the  text  has  been  processed  and 
iteratively  built  up  this  knowledge.  Also,  once  this 
hierarchy  of  seed  knowledge  has  been  built  up  for  a  given 
domain,  it  should  be  applicable  to  additional  documents  from 
that  domain.  The  text  analysis  procedures  should  thus  become 
increasingly  effective  and  require  less  intervention. 


3.2  ideetifisatiQn  of  index  tei:ms 

When  a  text  is  viewed  simply  as  symbol  strings,  a  number 
of  cues  become  apparent  which  might  help  to  identify  the  key 
terms  and  their  meaning.  These  include: 

document  cues — headings  and  paragraph  breaks 

sentences  breaks  and  punctuation 

sentence  length 

word  frequency 

word  length 

word  position  within  a  sentence,  paragraph  or  document 
word  order  and  interproximities--phrase  groupings 
phrase  frequency  and  length 
word  morphology--pref ixes  and  suffixes 

Higher  level  semantic  cues  include: 

word  function — syntax 
word  meaning--semantics 
context  or  theme 

The  above  cues  are  roughly  ordered  from  rather  weak,  global 
cues  to  those  which  are  stronger  and  more  specific.  The 
former  are  more  concrete  and  relatively  easy  to  compute, 
whereas  the  latter  are  abstract,  computationally  difficult  or 
costly,  often  requiring  external  knowledge. 


Figure  5.  Hierarchy  of  seed  knowledge  for  the  pltunbing  domain. 


For  the  purpose  of  representing  the  key  knowledge  in  a  text 
and  creating  an  index  of  key  words  and  semantic  relations,  we 
do  not  need  to  have  an  in-depth  understanding  of  the  text . 
Rather,  it  is  sufficient  to  obtain  a  limited  understanding 
which  focuses  on  key  words  and  concepts.  Important  concepts 
in  terms  of  indexing  are  likely  to  appear  in  a  context  which 
is  easy  to  understand  and  to  occur  with  some  frequency  so  that 
difficult  passages  can  be  ignored.  The  following  is  an 
overview  of  the  approach  we  used  in  creating  the  knowledge 
bases . 


3.3  woid  fiequency  and  seinantiQs 

From  an  examination  of„the  most  frequent  words  found  in 
studies  of  English  text ,  it  can  be  seen  that  words  that 
occur  with  high  frequency  generally  do  not  convey  much  meaning 
(although  they  are  important  syntactic  markers).  Figure  6, 
for  example,  depicts  the  50  most  frequent  English  words  from 
the  Kucera-Francis  study.  For  purposes  of  indexing,  these 
words  often  are  and  should  be  ignored.  This  simple  fact  is 
very  important  because  it  greatly  reduces  the  problem  of 
indexing.  As  an  illustration,  the  most  frequent  100  English 
words  account  for  nearly  half  of  all  of  the  words  encountered 
in  a  typical  text.  Furthermore,  it  is  quite  easy  to  automate 
this  task  of  omitting  highly  frequent  words  from  an  index  by 
use  of  an  exclusion  list  ( "stoplist " ) . 

A  correlary  of  the  inverse  relationship  between  frequency 
and  semantics  is  that  words  which  occur  very  infrequently  may 
be  critical  for  understanding  the  meaning  of  a  sentence. 

Thus,  although  there  are  a  limited  number  of  articles, 
conjunctions,  or  prepositions  in  English,  there  is  a  profusion 
of  nouns  and  verbs  which  are  used  to  express  subtle  shades  of 
meaning  or  domain-specific  terminology.  The  manual  encoding 
of  the  semantics  of  these  words  is  costly  and  laborious  and 
this  is  compounded  by  their  large  number  and  relatively  low 
frequency.  These  are  the  primary  terms  we  seek  to  identify 
and  use  for  indexing. 


3 . 4  Zrequescy  experiBieBts 

A  word- frequency  analysis  of  the  databases  confirmed  the 
validity  of  the  use  of  a  stoplist  of  frequent  terms  with  low 
semantic  content.  For  example,  the  most  frequent  words  in 
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Figure  6.  Most  frequently  occurring  English  words 
in  printed  text . 
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each  of  the  domains  showed  a  close  correspondence  with  the 
Kucera-Francis  list.  Moreover,  many  high  frequency  words 
which  were  not  on  this  list  were  indicative  of  the  domain 
(e.g.,  research,  training,  plumbing,  valve,  lava,  etc.).  We 
can  conclude  that  the  stoplist  significantly  reduces  the 
indexing  task  and  that  the  remaining  frequent  words  are  often 
key  concepts  in  a  domain.  This  is  an  important  point  for 
attempts  to  provide  seed  knowledge  for  a  domain.  However, 
given  that  frequently  occurring  words  will  have  more 
opportunities  to  be  extracted  during  the  processing  of  a  text, 
there  may  be  no  need  to  explicity  monitor  word  frequency. 

The  frequency  analyses  also  revealed  that  nouns  occurred 
with  a  much  greater  frequency  that  verbs.  Moreover,  each 
domain  tended  to  use  a  given  subset  of  verbs  which  thus 
provide  another  key  to  understanding  the  text  within  that 
domain . 

In  an  additional  frequency  study,  we  attempted  to  identify 
common  noun  phrases  (e.g.,  utility  company  serviceman)  by 
counting  co-occurrances  of  nonstoplist  words.  Many  of  the 
important  indexable  phrases  were  identified  by  this  method. 
However,  the  computational  expense  of  maintaining  these 
frequency  counts  was  too  high.  It  appears  that  some  other  way 
of  incorporating  these  phrases  as  index  terms  is  needed. 


3.5  Word  StSlffiing 

Keyword  indexing  systems  typically  remove  the  suffixes  of 
words,  thus  grouping  together  similar  word  forms.  This 
allows  a  more  compact  index  and  permits  a  single  query  to 
retrieve  all  the  various  word  forms.  A  problem  arises, 
however,  in  that  stemmed  words  are  more  ambiguous  (e.g., 
calculat-ion  and  calculat-or  become  synonymous).  Our 
experience  with  the  frequency  experiments  and  knowledge  base 
creation  process  suggested  that  some  stemming  was  necessary  to 
avoid  duplication  of  index  terms.  However,  we  concluded  that 
stemming  could  be  limited  to  conversion  of  plural  nouns  to 
their  singular  forms  and  to  conversion  of  marked  verb  forms 
(verbs  ending  in  "ing,"  "ed, "  or  "s,"  and  those  formed 
irregularly)  to  unmarked  forms. 
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TEXT  ANALYSIS 


Most  language  understanding  efforts  use  a  large  dictionary 
or  lexicon  (on  tlie  order  of  30.000  words  or  more).  Yet  these 
systems  often  encounter  words  which  are  not  represented  in  the 
lexicon,  especially  within  domains  which  use  special 
terminology.  No  fixed  set  of  words  and  word  usages  can  ever 
completely  encompass  a  natural  language  because  of  the  dynamic 
and  flexible  nature  of  natural  languages. 

Our  approach  to  text  analysis  is  robust  because  it  avoids 
the  use  of  such  a  large  lexicon.  We  attempt  to  provide  an 
exhaustive  knowledge  of  articles,  conjunctions,  copulas, 
functionals,  prepositions,  and  pronouns,  largely  corresponding 
to  the  most  frequent  English  words.  The  identification  of 
adjectives,  adverbs,  nouns,  and  verbs,  however,  is  left  to  the 
language  processor.  This,  of  course,  requires  the  ability  to 
appropriately  process  new  words  for  which  no  prior  knowledge 
has  been  stored.  However,  such  a  system  can  build  up  a 
lexicon  as  it  reads  text  and  so  naturally  customizes  the 
lexicon  to  the  text  domain. 

Our  language  processor  uses  word  order,  prepositions, 
knowledge  of  selected  verbs,  and  word  endings  to  identify  the 
parts  of  speech  and  to  infer  the  semantic  relations  among 
modifiers,  objects,  actions,  and  attributes.  For  example,  it 
infers  that  any  word  which  immediately  follows  an  article  must 
be  an  adjective  or  a  noun,  that  a  word  ending  in  "ly"  is  most 
likely  an  adverb,  that  a  noun  followed  by  a  verb  should  be 
represented  by  a  dQ  relation,  and  so  forth.  In  addition, 
certain  key  phrases  which  directly  express  important  semantic 
relations  (e.g.,  "a  <noun  phrase >  is  a  type  of  <noun  phrase > " 
or  "a  < noun  phrase >  is  composed  of  <noun  phrase>")  are 
specially  processed. 


4 . 1  CQmparisQB  with  tiaditiosal  iadesing 

In  order  to  evaluate  the  promise  of  an  AI  approach  to 
indexing,  we  wanted  to  contrast  our  approach  with  expert  human 
indexing  performance.  However,  little  has  been  published 
regarding  the  performance  of  human  indexers.  We  could  find  no 
standards  for  evaluating  current  indexes,  few  published 
research  reports,  and  only  a  few  standard  guidelines  used  by 
trained  indexers. 

One  research  report,  however,  attempted  to  contrast  the 
performance  of  a  number  of  trained  indexers  working  with  the 


same  short  segment  of  text.  The  sample  text  from  this 
experiment  is  depicted  in  figure  7  along  with  sections  of  the 
actual  book  index  which  referred  to  this  text.  The  major 
conclusions  of  the  study  can  be  briefly  summarized  here . 
Seventeen  human  indexers  selected  a  total  of  37  different 
index  terms.  The  indexers  chose  16  different  phrases  (e.g., 
silica  rich  lavas)  as  index  terms.  These  included  terms  which 
occurred  very  infrequently  or  not  at  all  in  the  text 
CvelQano) .  The  frequency  of  the  selection  of  a  few  key  terms 
by  many  of  the  indexers  did  indicate  some  agreement.  However, 
for  the  most  part,  these  results,  like  the  book  index,  showed 
human  indexing  to  be  highly  idiosyncratic. 

We  attempted  to  develop  a  semantic  net  index  for  this  same 
text  by  semiautomatically  applying  the  text  analysis 
techniques  described  above.  The  resulting  display  is  shown  in 
figure  8.  It  can  be  readily  seen  that  our  approach  requires 
much  less  information  to  capture  most  of  the  references 
incorporated  in  the  indexes  created  by  hand.  Furthermore,  the 
semantic  network  is  a  more  organized  representation  of  the 
knowledge  contained  in  the  text.  A  reference  to  yolcauQ  was 
included  as  a  seed  relation.  We  are  encouraged  by  this 
validation  of  our  approach. 

4.2  Verb  kBQwisdge 

Case  grammars  parse  a  sentence  on  the  basis  of  functional 
roles  rather  than  parts  of  speech.  These  grammars 
incorporate  knowledge  about  key  verbs  to  restrict  the 
candidates  words  which  can  fill  the  roles  of  agent,  object, 
instrument,  and  so  forth  for  a  given  verb.  These  cases  may  be 
considered  equivalent  to  the  slots  of  a  frame 
representation . 
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The  temperature  of  freshly  erupted  lava  is  rarely  much  above  the  melting-point, 
and  according  to  the  composition  and  gas  content  it  may  range  from  600  to 
1200@C,  the  basic  lavas,  like  basalt,  being  generally  the  hottest.  The  mobility  of 
molten  lava  depends  on  the  same  factors.  Silica -rich  lavas  are  usualy  stiff  aind 
viscous  and  congeal  as  thick  tongues  before  they  have  travelled  far,  whereas  basic 
lavas  tend  to  flow  freely  for  long  distances,  even  down  gentle  slopes,  before  they 
come  to  rest.  The  speed  of  a  lava  stream  depends  on  the  mobSty  and  slope,  and 
may,  quite  locally,  reach  60  miles  an  hour.  But  such  speeds  are  very  rarely  at¬ 
taint;  even  10  miles  an  hour  is  unusual  and  often  the  movement  is  sluggish.  In 
recent  years,  when  approaching  flows  have  threatened  villages  on  the  slopes  of 
Maurta  Loa  and  Etna,  the  danger  has  been  averted  by  bombim  from  aeroplanes, 
whereby  the  flows  have  been  constrained  to  follow  new  and  less  mertacing 
courses. 

The  surfaces  of  newly  consolidated  lava  flows  are  commor^y  of  two  contrasted 
types,  described  in  English  as  block  and  ropy  lavas,  but  known  technically  by  their 
Hawaiian  names,  aa  (ah -ah)  and  pahoahoe  respectively.  Block  lava  forms  over 
partly  crystalli;^  flows  from  whi^  the  gases  escape  in  sudden  bursts.  During 
the  advance  the  congealing  crust  breaks  into  a  wild  assemblage  of  rough, 
jagged,  scoriaceous  Mocks.  Ropy  lava  begins  at  a  higher  temperature;  minute 
bubbles  of  gas  escape  tranquilly  and  the  flow  congeals  with  a  smooth  skin  which 
wrinkles  into  ropy  and  corded  forms  like  those  assumed  by  flowing  pitch.  It 
sometimes  happens  that  after  the  upper  surface  and  edges  of  a  flow  of  tNs  kind 
have  solidified,  the  last  of  the  molten  lava  drains  away,  leaving  an  empty  tunnel. 
Some  of  the  lava  caves  of  Iceland  are  famous  fro  the  shining  black  icicles  of  glass 
which  adorn  their  roofs. 

When  lava  of  the  ropy  type  flows  over  the  sea  floor,  or  otherwise  beneath  a 
chilling  cover  of  water,  it  consolidates  with  a  structure  like  that  of  a  jumbled  heap 
of  piltows,  and  is  then  appropriately  described  as  pillow  lava.  By  the  time  each 
emerging  tongue  of  lava  has  swollen  to  about  the  size  of  a  pillow  the  rapidly  con¬ 
gealed  skin  prevents  further  growth.  New  tongues  which  then  exude  throu^ 
cracks  in  the  glassy  crust  similarly  swell  Into  piUows  and  so  the  process  continues. 
The  structure  is  a  common  one  in  the  submarine  lavas  associated  with  the 
geosynclinal  sediments  of  former  periods,  and  has  been  seen  actively  developing  in 
modern  flows  that  reached  the  sea  floor.  Columnar  structure  develops  within  the 
interior  of  thick  masses  of  lava  which  have  come  to  rest  and  have  consolidated 
under  stagrtent  conditions.  It  is  especially  characteristic  of  very  fine  grained 
plateau  basalts  which  are  relatively  free  from  vesicles. 


Figure  7.  Traditional  indexing  example. 

[from  Jones,  K.P.  "How  do  we  index:  A  report  of 
some  ASLIB  informatics  group  activity- " 

Ilie  JQurBal  of  DCQumeatatiCD .  voi.39.  no.i, 
pp.  1-23,  1983.] 


Schank  has  extended  the  idea  in  his  conceptual  dependency 
theory  to  permit  inferences,  which  pertain  to  the  motives  of 
the  participants  and  the  implicit  details  of  what  has 
transpired,  to  be  drawn  from  common  actions.  He  represents 
the  events  underlying  verbs  as  action  primitives  with  a 
specified  actor  and  object  and  a  direction  of  action.  For 
example,  the  transfer  of  possession  is  one  such  action 
primitive,  which  Schank  calls  ATRANS.  used  to  encompass  such 
verbs  as  give,  iake.  reCfiiYS.  and  sell .  By  recognizing  a  verb 
to  be  a  type  of  ATRANS,  the  meaning  of  the  sentence  can  be 
unravelled. 

Under  this  contract,  we  have  explored  the  application  of 
similar  knowledge  about  verbs  to  the  processing  of  text . 

While  it  is  not  our  objective  to  achieve  an  in-depth 
understanding,  our  experiments  suggest  that  it  is  important 
that  seed  knowledge  of  the  key  verbs  in  a  text  domain  be 
provided.  Verb  knowledge  is  central  to  the  task  of  language 
understanding  at  even  a  shallow  level  and  permits  the 
formation  of  inferences  which  can  link  terms  to  the  seed 
knowledge  object  hierarchy. 

In  our  system,  verb  knowledge  consists  of  the  root  verb 
forms  along  with  the  likely  categories  of  actors  and  objects. 
As  the  example  in  figure  9  shows,  this  verb  knowledge  is 
represented  as  a  semantic  network.  When  a  form  of  the  verb  is 
encountered  in  text ,  the  system  can  then  infer  that  the  actor 
and  object  are  members  of  the  specified  categories.  In  the 
example,  from  knowledge  of  the  verb  fix,  the  actor  is  inferred 
to  be  a  person  and  the  object  to  be  a  device  or  problem.  The 
network  representation  permits  creation  of  a  verb  hierarchy 
with  inheritance  of  the  inferential  knowledge.  Thus,  for 
example,  to  add  knowledge  of  the  verb  repaii:  to  the  system,  it 
would  be  necessary  only  to  specify  that  it  is  a  type  of  fix. 


4 . 3  Disambiguation  of  multiple  word  ffisaaiEgs 

The  problem  of  multiple  word  senses  or  ambiguity  pervades 
all  efforts  to  understand  text.  A  simple  word  like  "type"  can 
be  used  in  a  variety  of  ways  which  can  be  differentiated  and 
understood  only  on  the  basis  of  context  cues  or  other 
knowledge.  Conservative  estimates  suggest  that  over  30 
percent  of  occurrances  of  English  words  are  lexically 
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The  foobaz  fixed  the  Iramis  with  a  widget. 

(FOOBA2  type  PERSON^ 

(FRAMIS  type  PROBLEM)  or  (FRAMIS  type  DEVICE) 

(WIDGET  type  TOOL) 

The  accountant  fixed  his  mistake. 

John  is  fixing  the  leak. 

An  electrician  could  fix  the  switch  with  a  screwdriver. 

Figure  9.  Representation  of  verb  knowledge  with  examples 
of  application. 


ambiguous.  This  problem  is  thus  not  resolved  by  relianoe 
upon  a  large  lexicon  of  words.  Rather,  the  solution  requires 
a  greater  semantic  understanding  of  the  text.  This  can  be 
achieved  by  constraining  the  word  meanings  with  more  extensive 
domain  knowledge,  by  mechanisms  to  track  and  make  use  of 
context ,  ® °  by  logical  validation  techniques ,  or  by 
intervention  of  a  knowledge  engineer.  We  have  begun  to 
conduct  preliminary  experiments  is  this  area. 


4.4  CQQputscized  writifig  aids 

The  clarity  of  technical  writing,  whether  of  research 
reports,  documentation  or  abstracts  of  research,  is  generally 
more  of  a  problem  than  the  content .  Current  efforts  to 
improve  the  clarity  of  technical  writing  tend  to  be 
ineffective  because  they  rely  upon  editing  or  revision  by 
someone  other  than  the  original  author  or  because  they  use 
guidelines  which  are  difficult  to  apply  or  which  are  based  on 
faulty  or  Imprecise  assumptions.  Present  computerized  writing 
aids  provide  an  author  with  limited  feedback  largely 
restricted  to  statistical  analyses  or  to  analyses  of 
individual  sentences  in  isolation.  For  example,  these  systems 
can  detect  words  not  in  a  standard  vocabulary,  compute 
sentence  lengths,  flag  use  of  the  passive  voice  and  so  forth. 

A  new  approach  to  enhancing  the  clarity  of  teohnioal 
writing,  which  relies  upon  AI  techniques  for  understanding 
natural  language,  has  been  recently  outlined  by  Kieras.  This 
approach  uses  a  computerized  system  to  scan  a  text  and  detect 
problems  in  the  writing.  Correction  of  the  text  is  left  to 
the  author.  The  rules  used  by  such  a  computerized  system. 
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however,  would  be  based  on  findings  from  research  Involving 
the  psychology  of  comprehension.  Kleras  contends  that  these 
findings  provide  a  more  suitable  set  of  guidelines  for 
improving  the  clarity  of  writing  than  the  guidelines  which  are 
currently  used. 

This  new  approach  to  developing  advanced  computerized 
writing  aids  interests  us  for  several  reasons.  It  utilizes  a 
semantic  network  representation  of  the  text  to  track  and 
detect  difficulties  in  the  writing.  It  is  also  consistent 
with  our  contention  that  in-depth  text  understanding  is  beyond 
the  current  state-of-the-art  of  natural  language  understanding 
systems.  Like  our  work  on  indexing,  such  a  computerized 
writing  aid  would  not  require  an  in-depth  understanding  of  the 
text  because  the  goal  is  simply  to  identify  when  the 
comprehension  of  the  text  is  difficult.  Futhermore,  much  of 
our  effort  to  develop  language  understanding  algorithms  for 
the  identification  of  index  terms  and  semantic  relations  could 
be  carried  over  to  the  development  of  algorithms  for  detecting 
and  improving  the  clarity  of  writing. 


4.5  K&osiedge-baeed  input  a&gistant 

Through  our  experiments  with  the  form  1498  database  and 
conversations  with  the  Defense  Technical  Information  Center, 
we  have  identified  some  overall  shortcomings  of  the  1498 
abstracts  (some  of  which  were  mentioned  above)  which 
contribute  to  difficulties  in  indexing  and  retrieving  these 
research  summaries.  We  have  noted  an  important  connection 
between  the  quality  of  data  input  to  a  database  and  the 
ability  to  subsequently  organize  and  retrieve  this 
information.  The  input  problems  for  the  1498s  fall  into 
several  categories :  ( 1 )  a  lack  of  adequate  guidelines  or 

feedback  for  effectively  completing  the  form,  (2)  a  failure  on 
the  part  of  the  researcher  to  perceive  the  utility  of 
conscientious  completion  of  the  form,  (3)  inconsistant  or  poor 
usage  of  language,  and  (4)  inadequate  knowledge  of  related 
research . 

We  believe  the  quality  of  the  input  could  be  improved  by  an 
on-line  knowledge -based  assistant  which  would  automatically 
identify  shortcomings  during  the  input  process  and  provide 
knowledge,  feedback,  or  further  processing  of  the  text  to 
rectify  these  prcblems.  This  assistant  would  be  linked  to  our 
information  retrieval  system  and  would  have,  in  addition, 
knowledge  of  the  1498  format,  of  language  syntax  and 
semantics,  and  of  the  specific  research  domains  being  input. 
Knowledge  from  other  abstracts  would  be  retrieved  for  use  in 
guiding  the  input  and  to  inform  and  hopefully  motivate  the 
author . 


5. 


CONCLUSIONS  AND  RECOMMENDATIONS 


We  have  carefully  studied  the  characteristics  of  English 
documents  of  several  different  types  to  determine  the  most 
important  problems  confronting  an  effort  to  automatically 
index  text  by  extracting  keywords  and  semantic  relations.  The 
process  of  creating  such  knowledge  bases  was  found  to  vary 
according  to  the  particular  text  domain.  We  found  that  this 
effort  requires  an  initial  kernal  of  seed  knowledge, 
particularly  knowledge  of  the  specific  domain. 

Our  experiments  suggested  an  approach  which  makes  use  of 
high-frequency,  low-content  words  supplemented  with  knowledge 
of  key  verbs  and  seed  knowledge.  Under  this  contract  we 
developed  these  concepts  by  defining  the  parameters  of  the 
requisite  seed  knowledge  and  creating  a  framework  for 
representing  and  making  use  of  verb  knowledge.  The 
feasibility  of  this  approach  to  automatic  indexing  was 
validated  in  a  direct  comparison  with  human  indexing 
performance . 

We  have  identified  several  general  problem  areas  for 
further  study.  Among  these  are  three  primary  issues:  (1)  the 
problem  of  domain  knowledge  and  of  integrating  different 
domains  with  each  other  and  with  generic  seed  knowledge;  (2) 
use  of  context  for  pronoun  resolution,  word  disambiguation, 
and  identification  of  knowledge  domains;  and  (3)  handling 
multiple  text  references  underlying  a  single  semantic  relation 
from  multidocument,  multidomain  databases  and  especially, 
deciding  the  priority  of  such  references. 

We  also  recommend  that  a  knowledge-based  interactive  aid  be 
designed  and  developed  to  assist  an  author  in  preparing  the 
Research  and  Technology  Work  Unit  Summary  (form  DD-1496). 
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